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A l o a d cluster mana g ennent s ystem using SNMP and web 
Myung-Sup Kim, Mi-Joung Choi, Jannes W. Hong 

November 2002 International Journal of Network Management, Volume 12 issue 6 
Publisher: John Wiley & Sons, Inc. 

Full text available: Q pdf (355 . 47 KB) Additional Information: f u l l c ita t i on , abstract , reference s , index terms 

Clustered servers for Internet service is a popular solution to cope with the explosive 
increase in client requests. The high probability of service failure in cluster servers nnake 
the cluster management system necessary to provide high availability and convenient 
administrator control. In this paper, we present the design and implementation of a load 
cluster management system (LCMS) based on SNMP and Web technology. Our LCMS 
Implementation has been deployed on a commercial ultra-dense server. 



Distributed operating svstenns 

Andrew S. Tanenbaum, Robbert Van Renesse 

December 1985 ACM Computing Surveys (CSUR), Volume 17 issue 4 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
terms, review 



Full text available: gpdf(5.49 MB) 



Distributed operating systems have many aspects in common with centralized ones, but 
they also differ in certain ways. This paper is intended as an introduction to distributed 
operating systems, and especially to current university research about them. After a 
discussion of what constitutes a distributed operating system and how it is distinguished 
from a computer network, various key design issues are discussed. Then several 
examples of current research projects are examined In some detail ... 



Cluster-based scalable network services | 
Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Paul Gauthier 
October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 
ACM symposium on Operating systems principles SOSP '97, volume 3i issue 
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Publisher: ACM Press 

Full text available: gpdf(2.42 MB) Additional Information: full citation, re f eren ces, citings, indMler^^^ 
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Emerg en t ( mis) b eh a v i or vs. co mp l ex so f t w are systems 
Jeffrey C. Mogul 

April 2006 ACM SIGOPS Operating Systems Review , Proceedings of the 2006 

EuroSys conference EuroSys '06, volume 40 issue 4 
Publisher: ACM Press 

Full text available: Q pdf(391 .85 KB) Additional Information: full citation , abstract , references , index terms 

Complex systems often behave in unexpected ways that are not easily predictable from 
the behavior of their components; this is known as emergent behavior. As software 
systems grow in complexity, interconnectedness, and geographic distribution, we will 
increasingly face unwanted emergent behavior.Unpredictable software systems are hard 
to debug and hard to manage. We need better tools and methods for anticipating, 
detecting, diagnosing, and anneliorating emergent nnisbehavior. These tools an ... 

Keywords: complex systems, emergent behavior, emergent misbehavior 



5 Towards hi ghly re l i ab l e enter p rise network servic es v i a inference of multi-level 




dependencies 

Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming 
Zhang 

August 2007 ACM SIGCOMM Computer Comm unication Review , Proceedings of the 
2007 conference on Applications, technologies, architectures, and 
protocols for computer communications SIGCOMM '07, volume 37 issue 4 

Publisher: ACM Press 

Full text available: ^ pdf(679.67 KB) Additional Information: full citation , abstract , references , index terms 

Localizing the sources of performance problems in large enterprise networks is extremely 
challenging. Dependencies are numerous, complex and inherently multi-level, spanning 
hardware and software components across the network and the computing infrastructure. 
To exploit these dependencies for fast, accurate problem localization, we introduce an 
Inference Graph model, which is well-adapted to user-perceptible problems rooted in 
conditions giving rise to both partial service degradation ... 

Keywords: dependencies, fault localization, network and service management, 
probabilistic inference 




6 Capturing , indexing , clusterin g, and retrievin g s yste m histor y B 
^ Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox 
^ October 2005 ACM SIGOPS Operating Systems Review , Proceedings of the twentieth 
ACM symposium on Operating systems principles SOSP '05, volume 39 issue 

5 

Publisher: ACM Press 

Full text available: ^ pdf(516.41 KB) Additional Information: full citation , abstract , references , index terms 

We present a method for automatically extracting from a running system an indexable 
signature that distills the essential characteristic from a system state and that can be 
subjected to automated clustering and similarity-based retrieval to identify when an 
observed system state is similar to a previously-observed state. This allows operators to 
identify and quantify the frequency of recurrent problems, to leverage previous diagnostic 
efforts, and to establish whether problems seen at dif ... 

Keywords: bayesian networks, clustering, information retrieval, performance objectives, 
signatures 
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Scalability in MMOGs: L o ad b a lancin g for massi v ely m ul tiplayer online games | 
Fengyun Lu, Simon Parkin, Graham Morgan 

October 2006 Proceedings of 5th ACM SIGCOMM workshop on Network and system 

support for games NetGames '06 
Publisher: ACM Press 

Fuli text available: ^ pclf(544.67 KB) Additional Information: full citation , abstract , references , index terms 

Supporting thousands, possibly hundreds of thousands, of players is a requirennent that 
must be satisfied when delivering server based online gaming as a commercial concern. 
Such a requirement may be satisfied by utilising the cumulative processing resources 
afforded by a cluster of servers. Clustering of servers allow great flexibility, as the game 
provider may add servers to satisfy an increase in processing demands, more players, or 
remove servers for routine maintenance or upgrading. If ca ... 

8 Cases from the field: Field studies of computer s y stem administrators: analysis of | 
^ system management tools and practices 

^ Rob Barrett, Eser Kandogan, Paul P. Maglio, Eben M. Haber, Leila A. Takayama, Madhu 
Prabaker 

November 2004 Proceedings of the 2004 ACM conference on Computer supported 
cooperative work CSCW '04 

Publisher: ACM Press 

I- II* * I ui 0 ^r/^r^c nn l✓D^ AddltionaMnformation: ful! Citation, abstract, refere^^^^^ 
Full text available: TO pdf(4Q5.09 KB) 

terms 

Computer system administrators are the unsung heroes of the information age, working 
behind the scenes to configure, maintain, and troubleshoot the computer infrastructure 
that underlies much of modern life. However, little can be found in the literature about the 
* practices and problems of these highly specialized computer users. We conducted a series 
of field studies in large corporate data centers, observing organizations, work practices, 
tools, and problem-solving strategies of system admi ... 

Keywords: collaboration, command-line interfaces, ethnography, situation awareness, 
system administration 



9 A Self Manageable Infrastructure for Supporting Web-based Simulations 
Yingping Huang, Xiaorong Xiang, Gregory Madey 

April 2004 Proceedings of the 37th annual symposium on Simulation ANSS '04 
Publisher: IEEE Computer Society 

Full text available: ^ pdf( 574 . 08 KB) Additional Information: full citation , abstract. indexjMms 

In this paper, we describe the design and implementationof a self-manageable multi- 
tiered infrastructure tosupport web-based scientific simulations. This 
infrastructuredemonstrates not only the successful integration ofWeb servers, simulation 
servers, database servers, reportsservers, data warehousing and mining, but also the 
abilityto achieve self manageability: self-configuring, self-healing,self-protecting and self- 
optimizing. A scientificsimulation program, NOMSIM (Natural Organic MatterSimu ... 

° Automatic confi gur a t ion of internet services 
Wei Zheng, Ricardo Bianchini, Thu D. Nguyen 

March 2007 ACM SIGOPS Operating Systems Review , Proceedings of the 2007 

conference on EuroSys EuroSys '07, volume 4i issue 3 
Publisher: ACM Press 

Full text available: ^pdf(935.81 KB) Additional Information: full citation , abstract , references , index terms 

Recent research has found that operators frequently misconfigure Internet services, 
causing various availability and performance problems. In this paper, we propose a 
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software infrastructure that eliminates several types of misconfiguration by automating 
the generation of configuration files in Internet services, even as the services evolve. The 
infrastructure comprises a custom scripting language, configuration file templates, 
communicating runtime monitors, and heuristic algorithms to detec ... 

Keywords: configuration, internet services, manageability, operator mistakes 



Operating an d run time syst en ns f o r h ig h-end com puting sy stems : Pe r formanc e 
^ evaluation of automatic checkpoint-based fault tolerance for AMP! and Charm++ 
^ Gengbin Zheng, Chao Huang, Laxmikant V. Kale 

April 2006 ACM SIGOPS Operating Systems Review, volume 40 issue 2 

Publisher: ACM Press 

Full text available: ^ pdf(696.92 KB) Additional Information: full citation, a b str a ct. refereQces, index terms 

As the size of high performance clusters multiplies, the probability of system failure grows 
substantially, posing an increasingly significant challenge for scalability. Checkpoint-based 
fault tolerance methods are effective approaches at dealing with faults. With these 
methods, the state of the entire parallel application is checkpointed to reliable storage. 
When a fault occurs, the application is restarted from a recent checkpoint. However, the 
application developer is required to write signif ... 

12 Network engineering: Developing a functional Tcp/lp stack oriented towards Tcp 
^ connection replication 

^ Javier Paris, Alberto Valderruten, Victor M. Gulias 

October 2005 Proceedings of the 3rd international IFZP/ACM Latin American 

conference on Networking LANC '05 
Publisher: ACM Press 

Full text available: Q pdf(463.91 KB) Additional Information: full citation , abstract , references 

Functional languages are not often associated, with tine development of network stacks, 
mainly due to the lower performance and lack of support for system programming than 
more conventional languages such as C, However, there are functional languages that 
offer features which make it easier to develop network protocols than using a more 
conventional approach based on an imperative language. Eriang, for Instance, offers 
support for distribution, concurrency and soft real time built-in into the lang ... 

13 The process group approach to reliable distributed connputinq 
^ Kenneth P. Birman 

December 1993 Communications of the ACM, Volume 36 issue 12 

Publisher: ACM Press 

Full text available: ^ pdf (6.00 MB ) Additional Information: ful l cita ti on , references , citings, ind ex te rnns 



Keywords: fault-tolerant process groups, message ordering, multicast communication 



14 A quantitative analysis of cache policies for scalable network file systems 

^ Michael D. Dahlin, Clifford J. Mather, Randolph Y. Wang, Thomas E. Anderson, David A. 

^ Patterson 

May 1994 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
1994 ACM SIGMETRICS conference on Measurement and modeling of 
computer systems SIGMETRICS '94, Volume 22 issue 1 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
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Full text available: g pdf ( 1 . 42 MB ) terms 

Current network file system protocols rely heavily on a central server to coordinate file 
activity among client workstations. This central server can become a bottleneck that limits 
scalability for environments with large numbers of clients. In central server systems such 
as NFS and AFS, all client writes, cache misses, and coherence messages are handled by 
the server. To keep up with this workload, expensive server machines are needed, 
configured with high-performance CPUs, memory systems, ... 

Fas t res toration of real-tinne communication se r vice from component failures in multi- 

^ hop networks 

^ Seungjae Han, Kang G. Shin 

October 1997 ACM SIGCOMM Computer Comm unication Review , Proceedings of the 
ACM SIGCOMM '97 conference on Applications, technologies, 
architectures, and protocols for computer communication SIGCOMM 
'97, Volume 27 Issue 4 
Publisher: ACM Press 

.. * ^ u. 0t Additional Information: full citation , abstract , references , citings , index 

Full text available: TO p.df(1,,96 MB^^ ^ 

^ terms 

For many applications it is important to provide communication services with guaranteed 
timeliness and fault-tolerance at an acceptable level of overhead. In this paper, we 
present a scheme for restoring real-time channels, each with guaranteed timeliness, from 
component failures in multi-hop networks. To ensure fast/guaranteed recovery, backup 
channels are set up a priori in addition to each primary channel. That is, a dependable 
real-time connection consists of a pr ... 

1 6 Ag ility and Experimentation: Practical Techniq ue s for Resolvin g Architect u ral 
Tradeoffs 

T. C. Nicholas Graham, Rick Kazman, Chris Walmsley 

May 2007 Proceedings of the 29th International Conference on Software 

Engineering ICSE '07 
Publisher: IEEE Computer Society 

Full text available: ^pdf(3 44 . 0 7 K B ) Additional Information: full citation, abs tra ct, index terms 

This paper outlines our experiences with making architectural tradeoffs between 
performance, availability, security, and usability, in light of stringent cost and time-to- 
market constraints, in an industrial web-conferencing system. We highlight the difficulties 
in anticipating future architectural requirements and tradeoffs and the value of using 
agility and experiments as a tool for mitigating architectural risks in situations when up 
front pen-and- paper analysis is simply impossible. 

^'^ Practical b yzantine fault tolerance and proactive recover y 
Miguel Castro, Barbara Liskov 

November 2002 ACM Transactions on Computer Systems (TOCS), volume 20 issue 4 
Publisher: ACM Press 

. ^ •■ ui 0t ^x/H Additional Information: full citation , abstract , references , citings , index 

Full text available: TO pdf(1 .63 MB) : 

^ " terms , review 

Our growing reliance on online services accessible on the Internet dennands highly 
available systenns that provide correct service without interruptions. Software bugs, 
operator nnistakes, and nnalicious attacks are a major cause of service interruptions and 
they can cause arbitrary behavior, that is, Byzantine faults. This article describes a new 
replication algorithm, BFT, that can be used to build highly available systems that tolerate 
Byzantine faults. BFT can be used in practice to implement re ... 

Keywords: Byzantine fault tolerance, asynchronous systems, proactive recovery, state 
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18 Se r v i ce infas t ructu r e and network management: M o biPesk: mobile virtu a l desk top 

^ computing 

^ Ricardo A. Baratto, Sliaya Potter, Gong Su, Jason Nieh 

September 2004 Proceedings of the 10th annual international conference on Mobile 
computing and networking MobiCom '04 

Publisher: ACM Press 

r- .. * * •■ ui 01 ^*/iron on i^Dx Additional Information: fu II cltation, abstract, refcrences, otings^ 

Full text available: to pdf(580.39 KB) ; — - ~ 

^ terms 

We present MobiDesk, a mobile virtual desktop computing hosting infrastructure that 
leverages continued improvements in network speed, cost, and ubiquity to address the 
complexity, cost, and mobility limitations of today's personal computing infrastructure. 
MobiDesk transparently virtualizes a user's computing session by abstracting underlying 
system resources in three key areas: display, operating system, and network. It provides 
a thin virtualization layer that decouples a user's computing ses ... 

Keywords: computer utility, network mobility, on-demand computing, process migration, 
thin-client computing, virtualization 



SesMon^^^^^^ support for multimedia: Cost-effective streaming server 

^ implementation using Hi-tactix 
^ Damien Le Moal, Tadashi Takeuchi, Tadaaki Bandoh 

December 2002 Proceedings of the tenth ACM international conference on Multimedia 
MULTIMEDIA '02 

Publisher: ACM Press 

.- .. * ^ ■. u. 01 oc iyn\ Additional Information: full citation , abstract , references , citings, index 

Full text available: to p.df{271M.KB) ^ 

terms 

High performance and high quality for continuous media stream delivery needed by 
streaming server systems cannot be achieved efficiently using general-purpose operating 
systems, due to the overhead of the I/O mechanism implementation generally used. 
Special OS combined with powerful hardware can deliver better performance and quality 
but increases development complexity and deployment costs. The External I/O Engine 
Architecture adopts a hybrid approach, implementing streaming engines using the s ... 

Keywords: audio/video streaming, operating system, quicktime, real-time 



Heuristic methods fo r dyna mic load ba l an cing in a message-passing supercomputer Q 
Jian Xu, Kai Hwang 

November 1990 Proceedings of the 1990 ACM/IEEE conference on Supercomputing 
Supercomputing '90 

Publisher: IEEE Computer Society 

Full text available: ^p d f(1,Q4 MB) Additional Information: ful l cita tion, abstract, references 

In this paper, a new adaptive schenne is presented for dynamic load balancing on a 
message-passing multicomputer. The scheme is based on using easy-to-lmplement 
heuristics and variable threshold in migrating processes among the multicomputer nodes. 
It uses a distributed control over all processor nodes as coordinated by a host processor. 
Four heuristic methods for process migration are presented, which are distinguished by 
choosing different policies for process migration and threshold update. A ... 
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